Handling and analysis of the Enron Email Dataset - Part 1


The class definitions

  • EnronEmailParser class
    • Parser for the emails included in the Enron Email Dataset.
    • This particular implementation treats all recipients including to, cc and bcc recipients as same type
  • EnronEmailDataset class
    • Data handler for the Enron Email Dataset
    • It relies on the EnronEmailParser class to do the actual email parsing.
    • It uses pandas dataframes as the data storage objects.

In [1]:
from IPython.display import display
import pandas as pd
from enrondatahandling import EnronEmailDataset

Basic Setup

Having defined the basic classes that will handle the data and parsing for us, we can now start to load and parse our data. The two main tables, aka dataframes, are shown below (limited to the top 5 rows in each case).


In [2]:
# Load and parse the enron email dataset
enronData = EnronEmailDataset('./data')


Surveyed 1702 email files
Parsed 1702 emails
Found 83 responses

In [3]:
# Let's take a look at the emails table
enronData.emails.head()


Out[3]:
email_id datetime ts tz sender num_tos num_ccs num_bccs num_recipients subject num_lines_in_msg
email_id
./data/4/54650.txt ./data/4/54650.txt 2001-06-28 04:04:57-07:00 993726297 tzoffset(u'PDT', -25200) j.kaminski@enron.com 1 0 0 1 RE: Thu evening 78
./data/6/173776.txt ./data/6/173776.txt 2000-07-18 06:49:00-07:00 963928140 tzoffset(u'PDT', -25200) steven.kean@enron.com 1 0 0 1 Re: Price Cap Media--DRAFT 81
./data/1/138102.txt ./data/1/138102.txt 2001-11-14 08:35:46-08:00 1005755746 tzoffset(u'PST', -28800) john.shelk@enron.com 1 1 0 2 RE: Dynegy/Enron Point of Contact 51
./data/1/173413.txt ./data/1/173413.txt 2000-02-20 09:53:00-08:00 951069180 tzoffset(u'PST', -28800) steven.kean@enron.com 1 2 0 3 Re: Trade Mission 315
./data/1/219048.txt ./data/1/219048.txt 2001-08-10 15:40:25-07:00 997483225 tzoffset(u'PDT', -25200) ray.alvarez@enron.com 2 2 0 4 CONFIDENTIAL Attached file 15

In [4]:
# The recipients table is being maintained separately so as to not keep lists as values in the dataframe
enronData.recipients.head()


Out[4]:
email_id recipient type
0 ./data/1/10425.txt kenneth.lay@enron.com to
1 ./data/1/10425.txt mark.frevert@enron.com to
2 ./data/1/10425.txt jeff.skilling@enron.com to
3 ./data/1/10425.txt mark.schroeder@enron.com to
4 ./data/1/10425.txt joseph.sutton@enron.com to

Basic analysis

Let's now do some basic analysis to see how we can use this data and play with it to get some insights and information of value.

Note: In both the questions below, I have included the people on the to list as well as the cc list and the bcc list to mean recipients.

Question 1

In the next couple sections I am trying to answer the following question:

Let's label an email as "direct" if there is exactly one recipient and "broadcast" if it has multiple recipients. Identify the top 3 people who received the largest number of direct emails and the person (or people) who sent the largest number of broadcast emails.


In [5]:
directs = pd.merge(
    enronData.recipients, 
    enronData.emails[enronData.emails['num_recipients'] == 1], 
    left_on='email_id', 
    right_index=True)[['ts', 'recipient']]
directs = (
    directs.groupby('recipient')
    .count()
    .rename(columns={'ts': 'count_direct'})
    .sort_values(by='count_direct', ascending=[0]))
directs.head()


Out[5]:
count_direct
recipient
maureen.mcvicker@enron.com 115
vkaminski@aol.com 43
jeff.dasovich@enron.com 25
richard.shapiro@enron.com 23
elizabeth.linnell@enron.com 18

In [6]:
broadcasts = enronData.emails[enronData.emails['num_recipients'] > 1][['sender', 'ts']]
broadcasts = (
    broadcasts.groupby('sender')
    .count()
    .rename(columns={'ts': 'count_broadcast'})
    .sort_values(by='count_broadcast', ascending=[0]))
broadcasts.head()


Out[6]:
count_broadcast
sender
steven.kean@enron.com 252
john.shelk@enron.com 83
j.kaminski@enron.com 40
miyung.buster@enron.com 31
alan.comnes@enron.com 19

Answer 1

Based on the outputs above, we can say:

  • The top three people who received the largets number of direct mail are:
    1. Maureen McVicker (maureen.mcvicker@enron.com)
    2. V Kaminski (vkaminski@aol.com)
    3. Jeff Dasovich (jeff.dasovich@enron.com)
  • The person who sent the largest number of direct email is Steven Kean

Question 2

In the section I am trying to answer the following question:

Find the five emails with the fastest response times. Please include file IDs, subject, sender, recipient, and response times. (A response is defined as a message from one of the recipients to the original sender whose subject line contains all of the words from the subject of the original email, and the response time should be measured as the difference between when the original email was sent and when the response was sent.)


In [7]:
responses = enronData.responses.sort_values(by='response_time_in_secs').reset_index()
responses = responses[[
        'email_id', 
        'sender',
        'subject', 
        'email_id_response', 
        'sender_response',
        'subject_response', 
        'response_time_in_secs']]
responses.head()


Out[7]:
email_id sender subject email_id_response sender_response subject_response response_time_in_secs
0 ./data/1/139495.txt rod.hayslett@enron.com FW: Confidential - GSS Organization Value to ETS ./data/1/151121.txt stanley.horton@enron.com FW: Confidential - GSS Organization Value to ETS 148
1 ./data/1/228996.txt michelle.cash@enron.com RE: CONFIDENTIAL Personnel issue ./data/4/228911.txt lizzette.palmer@enron.com RE: CONFIDENTIAL Personnel issue 236
2 ./data/4/122923.txt paul.kaufman@enron.com RE: Eeegads... ./data/3/122926.txt jeff.dasovich@enron.com RE: Eeegads... 240
3 ./data/1/121747.txt karen.denne@enron.com Re: CONFIDENTIAL - Residential in CA ./data/3/121748.txt jeff.dasovich@enron.com Re: CONFIDENTIAL - Residential in CA 240
4 ./data/1/201878.txt m..tholt@enron.com FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C... ./data/4/200845.txt stephanie.miller@enron.com FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C... 262

Answer 2

Based on the outputs above, we can say that the five emails with the fastest response times in order are:

  1. data/1/139495.txt sent by rod.hayslett@enron.com regarding "FW: Confidential - GSS Organization Value to ETS"
  2. data/1/228996.txt sent by michelle.cash@enron.com regarding "RE: CONFIDENTIAL Personnel issue"
  3. data/4/122923.txt sent by paul.kaufman@enron.com regarding "RE: Eeegads..."
  4. data/1/121747.txt sent by karen.denne@enron.com regarding "Re: CONFIDENTIAL - Residential in CA"
  5. data/1/201878.txt sent by m..tholt@enron.com regarding "FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C..."

In [ ]: